Final Project (Group 14)

Import Packages

import pandas as pd
import os
import geopandas as gpd
import matplotlib.pyplot as plt
import altair as alt

Import Illegal Pets Data and Education Attainment Data in NYC

# Illegal Pets Data
illegal_pets = pd.read_csv("Illegal Pets/illegal_animals_kept_as_pets_20241116.csv")

# Education Attainment Data
education = pd.read_csv("Education Attainment/ACSST5Y2010.S1501-Data.csv")

Combine all education data from 2010 to 2022

# Directory where all CSV files are stored
directory = "Education Attainment"

# Collect all file paths in the directory that match the naming pattern
csv_files = [os.path.join(directory, file) for file in os.listdir(directory) if file.endswith(".csv")]

# List to hold individual DataFrames
dataframes = []

# Loop through each file, read it into a DataFrame, and append it to the list
for file in csv_files:
    # Extract the year from the file name (e.g., "ACSST5Y2010" -> "2010")
    year = os.path.basename(file).split("ACSST5Y")[1][:4]
    df = pd.read_csv(file)
    df['YEAR'] = year
    
    # Append the modified DataFrame to the list
    dataframes.append(df)

# Concatenate all DataFrames into one
combined_df = pd.concat(dataframes, ignore_index=True)

# Sort the combined DataFrame by 'GEO_ID' in descending order
sorted_df = combined_df.sort_values(by='GEO_ID', ascending=False)

Anayze meanings of variables

Each year’s dataset’s variable has different interpretation, so it it critical to firstly select the most relevant variable for futher analysis. In our project, we only select total estimate to simplify our anlysis without considering gender and earning. For example, S1501_C01_015’s definition has changed in 2018 from ‘percent of bachelor degree or higher’ to ‘total population of bachelor degreee or higher.’ Thus, we have to manually clean these messy variables.

# Select the top 13 rows (containing each years variable definition) to analyze vairbales meaning
top_13_rows = sorted_df.head(13)

# Identify columns where any cell contains "female", "male", or "margin"
columns_to_drop = top_13_rows.columns[
    top_13_rows.apply(lambda col: col.astype(str).str.contains(r"female|male|margin|nan|earning|race", case=False, na=False).any())
]

# Drop the identified columns
top_13_rows_f = top_13_rows.drop(columns=columns_to_drop)
top_13_rows_f = top_13_rows_f.dropna(axis=1, how='all')

Select necessery variables

# Cleaning Education Attainment Data
# Rename selected columns for better readability
column_rename_map = {
    "GEO_ID": "geo_id",  # Geographic identifier
    "NAME": "area_name",  # Geographic area name
    "YEAR": "year",  # Year of data
    "S1501_C01_006E": "pop_25_plus",  # Population 25 years and over
    "S1501_C01_007E": "pop_25_less_9th",  # 25 years and over: Less than 9th grade
    "S1501_C01_009E": "pop_25_hs_grad",  # 25 years and over: High school graduate
    "S1501_C01_013E": "pop_25_bach_plus",  # 25 years and over: Bachelor's degree or higher
    "S1501_C01_019E": "pop_25_34_bach_plus",  # 25-34: Bachelor's degree or higher
    "S1501_C01_021E": "pop_45_64_bach_plus"  # 45-64: Bachelor's degree or higher
}

# Create selected_columns based on column_rename_map keys
selected_columns = list(column_rename_map.keys())

# Filter the combined DataFrame to only keep the selected columns
filtered_df = combined_df[selected_columns]

# Rename the columns
filtered_df.rename(columns=column_rename_map, inplace=True)

# Remove rows where GEO_ID equals "Geographic Area Name"
edu_cleaned = filtered_df[filtered_df["geo_id"] != "Geographic Area Name"]

# Adjust variables to accommodate different years' approaches to estimate (volume or percentage)
# Transform year 2010 to year 2014's 'pop_25_less_9th' variable to be the result between 'pop_25_less_9th' and 'pop_25_plus'
for year in range(2010, 2015):
    edu_cleaned.loc[edu_cleaned['year'] == str(year), 'pop_25_less_9th'] = pd.to_numeric(edu_cleaned['pop_25_less_9th'], errors='coerce') * pd.to_numeric(edu_cleaned['pop_25_plus'], errors='coerce') / 100
    edu_cleaned.loc[edu_cleaned['year'] == str(year), 'pop_25_hs_grad'] = pd.to_numeric(edu_cleaned['pop_25_hs_grad'], errors='coerce') * pd.to_numeric(edu_cleaned['pop_25_plus'], errors='coerce') / 100
    edu_cleaned.loc[edu_cleaned['year'] == str(year), 'pop_25_bach_plus'] = pd.to_numeric(edu_cleaned['pop_25_bach_plus'], errors='coerce') * pd.to_numeric(edu_cleaned['pop_25_plus'], errors='coerce') / 100

# Cleaning Illegal Pets Data
# List of columns to drop
columns_to_drop = [
    'Agency', 'Agency Name', 'Complaint Type', 'Cross Street 1', 'Cross Street 2',
    'Intersection Street 1', 'Intersection Street 2', 'City', 'Landmark',
    'Facility Type', 'Community Board', 'Park Facility Name', 'Vehicle Type',
    'Taxi Company Borough', 'Taxi Pick Up Location', 'Bridge Highway Name',
    'Bridge Highway Direction', 'Road Ramp', 'Bridge Highway Segment'
]

# Drop the specified columns
illegal_pets = illegal_pets.drop(columns=columns_to_drop, errors='ignore')
/var/folders/k2/prgbv7z97knbd104r93pncfc0000gp/T/ipykernel_12590/2495111871.py:22: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

Standardization of County Name

# Exam unqiue values in both dataframes for furthern merging
print(edu_cleaned['area_name'].unique())

# Create county name list in NYC
nyc_counties = [
    "New York County, New York",  # Manhattan
    "Kings County, New York",     # Brooklyn
    "Queens County, New York",    # Queens
    "Bronx County, New York",     # The Bronx
    "Richmond County, New York"   # Staten Island
]

# Select counties in NYC and filter them
edu_nyc = edu_cleaned[edu_cleaned['area_name'].isin(nyc_counties)]

# Mapping of county names to simplified borough names
county_replacement_map = {
    "New York County, New York": "New York",
    "Kings County, New York": "Kings",
    "Queens County, New York": "Queens",
    "Bronx County, New York": "The Bronx",
    "Richmond County, New York": "Staten Island"
}

# Replace county names in the 'name' column using the map
edu_nyc['area_name'] = edu_nyc['area_name'].replace(county_replacement_map)

# Exam unqiue values in both dataframes for furthern merging
print(illegal_pets['Borough'].unique())

# Mapping Borough
borough_to_county = {
    "BROOKLYN": "Kings",
    "STATEN ISLAND": "Staten Island",
    "QUEENS": "Queens",
    "MANHATTAN": "New York",
    "BRONX": "The Bronx"
}

# Map the 'Borough' column to new 'County' values
illegal_pets['area_name'] = illegal_pets['Borough'].map(borough_to_county)

print(illegal_pets['area_name'].unique())
print(edu_nyc['area_name'].unique())
['Geographic Area Name' 'New York' 'Albany County, New York'
 'Allegany County, New York' 'Bronx County, New York'
 'Broome County, New York' 'Cattaraugus County, New York'
 'Cayuga County, New York' 'Chautauqua County, New York'
 'Chemung County, New York' 'Chenango County, New York'
 'Clinton County, New York' 'Columbia County, New York'
 'Cortland County, New York' 'Delaware County, New York'
 'Dutchess County, New York' 'Erie County, New York'
 'Essex County, New York' 'Franklin County, New York'
 'Fulton County, New York' 'Genesee County, New York'
 'Greene County, New York' 'Hamilton County, New York'
 'Herkimer County, New York' 'Jefferson County, New York'
 'Kings County, New York' 'Lewis County, New York'
 'Livingston County, New York' 'Madison County, New York'
 'Monroe County, New York' 'Montgomery County, New York'
 'Nassau County, New York' 'New York County, New York'
 'Niagara County, New York' 'Oneida County, New York'
 'Onondaga County, New York' 'Ontario County, New York'
 'Orange County, New York' 'Orleans County, New York'
 'Oswego County, New York' 'Otsego County, New York'
 'Putnam County, New York' 'Queens County, New York'
 'Rensselaer County, New York' 'Richmond County, New York'
 'Rockland County, New York' 'St. Lawrence County, New York'
 'Saratoga County, New York' 'Schenectady County, New York'
 'Schoharie County, New York' 'Schuyler County, New York'
 'Seneca County, New York' 'Steuben County, New York'
 'Suffolk County, New York' 'Sullivan County, New York'
 'Tioga County, New York' 'Tompkins County, New York'
 'Ulster County, New York' 'Warren County, New York'
 'Washington County, New York' 'Wayne County, New York'
 'Westchester County, New York' 'Wyoming County, New York'
 'Yates County, New York']
['BROOKLYN' 'STATEN ISLAND' 'QUEENS' 'MANHATTAN' 'BRONX']
['Kings' 'Staten Island' 'Queens' 'New York' 'The Bronx']
['The Bronx' 'Kings' 'New York' 'Queens' 'Staten Island']
/var/folders/k2/prgbv7z97knbd104r93pncfc0000gp/T/ipykernel_12590/2301969997.py:26: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

Standardization Time Variable

# Format both columns as strings
illegal_pets['year'] = pd.to_datetime(illegal_pets['Created Date'], errors='coerce').dt.strftime('%Y')
edu_nyc['year'] = pd.to_datetime(edu_nyc['year'], errors='coerce').dt.strftime('%Y')
/var/folders/k2/prgbv7z97knbd104r93pncfc0000gp/T/ipykernel_12590/3968184602.py:3: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

Merging

# Drop 2023 and 2024 because education data does not includes these years
illegal_pets = illegal_pets[~illegal_pets['year'].isin(['2023', '2024'])]

# Merge based on year and area_name (County name)
merged_df = pd.merge(illegal_pets, edu_nyc, on=['year', 'area_name'], how='left')
print(merged_df.head())

# Save the merged DataFrame to a CSV file
merged_df.to_csv("final_education_illegal_pets.csv", index=False)
   Unique Key            Created Date Closed Date Descriptor Location Type  \
0    50293012  04/16/2021 08:11:36 AM         NaN    Rooster     Residence   
1    50959265  06/21/2021 08:22:44 PM         NaN      Other     Residence   
2    53173404  01/25/2022 02:59:43 AM         NaN    Rooster     Residence   
3    54820319  07/17/2022 09:46:58 AM         NaN    Rooster     Residence   
4    50293717  04/16/2021 08:05:30 AM         NaN    Rooster     Residence   

   Incident Zip         Incident Address       Street Name Address Type  \
0       11234.0         2733 MILL AVENUE       MILL AVENUE      ADDRESS   
1       10314.0           SUFFOLK AVENUE    SUFFOLK AVENUE    BLOCKFACE   
2       11385.0    59-29 PALMETTO STREET   PALMETTO STREET      ADDRESS   
3       11367.0  144-18 MELBOURNE AVENUE  MELBOURNE AVENUE      ADDRESS   
4       11234.0         2739 MILL AVENUE       MILL AVENUE      ADDRESS   

        Status  ...                                  Location      area_name  \
0  In Progress  ...  (40.607676602285885, -73.91426351943026)          Kings   
1  In Progress  ...                                       NaN  Staten Island   
2  In Progress  ...    (40.70732750183838, -73.9024888321992)         Queens   
3  In Progress  ...   (40.73427388136913, -73.82462533545319)         Queens   
4  In Progress  ...   (40.60760531818615, -73.91437165485081)          Kings   

   year          geo_id  pop_25_plus  pop_25_less_9th pop_25_hs_grad  \
0  2021  0500000US36047      1876692           158254         473782   
1  2021  0500000US36085       344389            17431         102905   
2  2022  0500000US36081      1709759           174351         444621   
3  2022  0500000US36081      1709759           174351         444621   
4  2021  0500000US36047      1876692           158254         473782   

   pop_25_bach_plus  pop_25_34_bach_plus pop_45_64_bach_plus  
0            298611               383827              184895  
1             47218                61972               27572  
2            229323               323773              132332  
3            229323               323773              132332  
4            298611               383827              184895  

[5 rows x 29 columns]

Load Map Data, drop NA and convert data type

# Load the GeoJSON file
geojson_path = 'nyc-zips.geojson'
nyc_zips = gpd.read_file(geojson_path)

# Drop rows with NaN ZIP codes or fill them with a placeholder
merged_df = merged_df.dropna(subset=['Incident Zip'])

# Convert ZIP codes to strings without decimals
merged_df['Incident Zip'] = merged_df['Incident Zip'].apply(lambda x: str(int(x)))

# Ensure both columns are strings
nyc_zips['postalCode'] = nyc_zips['postalCode'].astype(str)

Final Project Visualization

1. Bar chart: types of illegal pets in NYC

# 1. Bar chart of types of illegal pets
chart = alt.Chart(merged_df).mark_bar().encode(
    y=alt.Y('Descriptor:N', 
            sort='-x',  # Sort by count in descending order
            title='Pet Type'),
    x=alt.X('count():Q',
            title='Number of Incidents'),
    tooltip=['Descriptor', alt.Tooltip('count():Q', title='Count')]
).properties(
    title='Types of Illegal Pets in NYC',
    width=600,
    height=400
).configure_axis(
    labelFontSize=12,
    titleFontSize=14
).configure_title(
    fontSize=16,
    anchor='middle'
)

chart

2. Pie chart of incidents across boroughs

# Calculate percentages and create pie chart with labels
total = len(merged_df)
df_with_pct = merged_df.assign(
    percentage=lambda x: (x.groupby('Borough')['Borough'].transform('count') / total * 100).round(1)
)

# Create a summary dataframe for the text labels
summary_df = (df_with_pct.groupby('Borough')
             .agg(count=('Borough', 'count'),
                  percentage=('percentage', 'first'))
             .reset_index()
             .assign(label=lambda x: x['Borough'] + ': ' + x['percentage'].astype(str) + '%'))

# Create pie chart with percentages
chart = alt.Chart(df_with_pct).mark_arc(outerRadius=180).encode(
    theta='count():Q',
    color=alt.Color('Borough:N', 
                   scale=alt.Scale(scheme='tableau10'),
                   legend=alt.Legend(title="Borough")),
    tooltip=['Borough:N', 
            alt.Tooltip('count():Q', title='Count'),
            alt.Tooltip('percentage:Q', title='Percentage', format='.1f')]
).properties(
    title={
        'text': 'Distribution of Illegal Pet Incidents by Borough',
        'fontSize': 16,
        'anchor': 'middle'
    },
    width=400,
    height=400
)

# Add text labels with percentages
text = alt.Chart(summary_df).mark_text(radius=120, size=11).encode(
    theta=alt.Theta('count:Q', stack=True),
    text='label:N'
)

# Combine chart and labels
final_chart = (chart + text).configure_view(
    strokeWidth=0
)

final_chart

3. Choropleth Map

import geopandas as gpd
import pandas as pd
import json

# Load GeoJSON file
geojson_path = 'nyc-zips.geojson'
nyc_zips = gpd.read_file(geojson_path)
merged_df = pd.read_csv("final_education_illegal_pets.csv")

# Data Preprocessing
merged_df = merged_df.dropna(subset=['Incident Zip'])
merged_df['Incident Zip'] = merged_df['Incident Zip'].apply(lambda x: str(int(x)))
nyc_zips['postalCode'] = nyc_zips['postalCode'].astype(str)

# Count complaints by ZIP code
pet_counts = merged_df.groupby('Incident Zip').size().reset_index(name='complaints_count')

# Merge counts with GeoDataFrame
nyc_zips = nyc_zips.merge(pet_counts, left_on='postalCode', right_on='Incident Zip', how='left')
nyc_zips['complaints_count'] = nyc_zips['complaints_count'].fillna(0)

# Convert GeoJSON to TopoJSON (Altair supports TopoJSON)
geojson_data = json.loads(nyc_zips.to_json())

# Create Altair Chart
choropleth = alt.Chart(alt.Data(values=geojson_data['features'])).mark_geoshape().encode(
    color=alt.Color('properties.complaints_count:Q', 
                    scale=alt.Scale(scheme='orangered'), 
                    title='Illegal Pet Complaints'),
    tooltip=[
        alt.Tooltip('properties.postalCode:N', title='ZIP Code'),
        alt.Tooltip('properties.complaints_count:Q', title='Complaints')
    ]
).transform_calculate(
    complaints_count='datum.properties.complaints_count'
).properties(
    title='Choropleth Map of Illegal Pet Complaints in NYC by ZIP Code',
    width=800,
    height=600
)

# Display the chart
choropleth

4. Line Plot: Different education level over time

import altair as alt

# Prepare the data (keeping the same data preparation steps)
education_levels_over_time = merged_df.groupby('year')[
    ['pop_25_less_9th', 'pop_25_hs_grad', 'pop_25_bach_plus']
].mean().reset_index()

education_levels_long = education_levels_over_time.melt(
    id_vars='year',
    value_vars=['pop_25_less_9th', 'pop_25_hs_grad', 'pop_25_bach_plus'],
    var_name='Education Level',
    value_name='Population'
)

education_level_mapping = {
    'pop_25_less_9th': 'Less than 9th Grade',
    'pop_25_hs_grad': 'High School Graduate',
    'pop_25_bach_plus': 'Bachelor\'s Degree or Higher'
}
education_levels_long['Education Level'] = education_levels_long['Education Level'].map(education_level_mapping)

# Create the line chart using Altair
chart = alt.Chart(education_levels_long).mark_line(point=True).encode(
    x=alt.X('year:O', title='Year'),
    y=alt.Y('Population:Q', title='Average Population'),
    color=alt.Color('Education Level:N', title='Education Level'),
    tooltip=['year', 'Education Level', alt.Tooltip('Population:Q', format=',')]
).properties(
    title='Different Education Levels Over Time',
    width=800,
    height=400
).configure_axis(
    labelFontSize=12,
    titleFontSize=14
).configure_title(
    fontSize=16
)

chart

5. Bar Plot: Different education level by districts

import altair as alt
import pandas as pd

# Aggregate education data by borough
education_levels_by_district = merged_df.groupby('Borough')[
    ['pop_25_less_9th', 'pop_25_hs_grad', 'pop_25_bach_plus']
].mean().reset_index()

# Melt the data for easier plotting with Altair
education_levels_long = education_levels_by_district.melt(
    id_vars='Borough',
    value_vars=['pop_25_less_9th', 'pop_25_hs_grad', 'pop_25_bach_plus'],
    var_name='Education Level',
    value_name='Population'
)

# Map the correct labels to the "Education Level" column
education_level_mapping = {
    'pop_25_less_9th': 'Less than 9th Grade',
    'pop_25_hs_grad': 'High School Graduate',
    'pop_25_bach_plus': 'Bachelor\'s Degree or Higher'
}
education_levels_long['Education Level'] = education_levels_long['Education Level'].map(education_level_mapping)

 
# Create Altair chart
chart = alt.Chart(education_levels_long).mark_bar().encode(
    x=alt.X('Borough:N', title='Borough'),
    y=alt.Y('Population:Q', title='Average Population'),
    color=alt.Color('Education Level:N', title='Education Level'),
    tooltip=['Borough', 'Education Level', 'Population']
).properties(
    title='Different Education Levels by District',
    width=800,
    height=400
).configure_axis(
    labelFontSize=12,
    titleFontSize=14
).configure_title(
    fontSize=16,
    anchor='middle'
)

chart

6. Correlation Analysis

import requests
import geopandas as gpd
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns
import folium

# 2. Aggregate Data
# Count illegal pet incidents by district
incident_counts = merged_df.groupby('Borough')['Descriptor'].count().reset_index()
incident_counts.rename(columns={'Descriptor': 'Illegal_Pet_Incidents'}, inplace=True)

# Calculate mean education levels by district
education_data = merged_df.groupby('Borough')[
    ['pop_25_less_9th', 'pop_25_hs_grad', 'pop_25_bach_plus', 'pop_25_plus']
].mean().reset_index()

# Merge datasets
district_data = pd.merge(incident_counts, education_data, on='Borough')

# Calculate the percentage of each education level relative to the total population
district_data['pop_25_less_9th_%'] = (district_data['pop_25_less_9th'] / district_data['pop_25_plus']) * 100
district_data['pop_25_hs_grad_%'] = (district_data['pop_25_hs_grad'] / district_data['pop_25_plus']) * 100
district_data['pop_25_bach_plus_%'] = (district_data['pop_25_bach_plus'] / district_data['pop_25_plus']) * 100

# Calculate incident percentage
district_data['Incident_Percentage(%)'] = (district_data['Illegal_Pet_Incidents'] / district_data['pop_25_plus']) * 100

# 3. Perform Regression: Using all education levels (as percentages) as predictors
X = district_data[['pop_25_less_9th_%', 'pop_25_hs_grad_%', 'pop_25_bach_plus_%']]
y = district_data['Incident_Percentage(%)']

# Add a constant to the model (for the intercept)
X = sm.add_constant(X)

# Fit the regression model
model = sm.OLS(y, X).fit()
print(model.summary())

import altair as alt
import pandas as pd

# Use Altair to create regression plots for each feature
for feature, label in zip(
    ['pop_25_less_9th_%', 'pop_25_hs_grad_%', 'pop_25_bach_plus_%'],
    ["% Population with Less than 9th Grade Education",
     "% Population with High School Graduation",
     "% Population with Bachelor's Degree or Higher"]
):
    # Determine the min and max for the x-axis range with a little padding
    x_min = district_data[feature].min() * 0.9  # 10% padding below min
    x_max = district_data[feature].max() * 1.1  # 10% padding above max

    # Create Altair chart
    chart = alt.Chart(district_data).mark_point(filled=True, size=100).encode(
        x=alt.X(
            f'{feature}:Q', 
            title=label, 
            scale=alt.Scale(domain=[x_min, x_max])  # Dynamically set x-axis range
        ),
        y=alt.Y('Incident_Percentage(%):Q', title="Illegal Pet Incident Percentage (%)"),
        tooltip=[
            'Borough', 
            alt.Tooltip(f'{feature}:Q', title=label), 
            alt.Tooltip('Incident_Percentage(%):Q', title='Incident %')
        ]
    ).properties(
        title=f"Regression: Illegal Pet Incident Percentage vs. {label}",
        width=800,
        height=400
    )

    # Add regression line
    regression_line = chart.transform_regression(
        f'{feature}', 'Incident_Percentage(%)'
    ).mark_line(color='red')

    # Combine scatter plot and regression line
    combined_chart = chart + regression_line

    # Display the chart directly (no `.show()`)
    display(combined_chart.configure_title(
        fontSize=16,
        anchor='middle'
    ).configure_axis(
        labelFontSize=12,
        titleFontSize=14
    ))
                              OLS Regression Results                              
==================================================================================
Dep. Variable:     Incident_Percentage(%)   R-squared:                       0.999
Model:                                OLS   Adj. R-squared:                  0.997
Method:                     Least Squares   F-statistic:                     474.2
Date:                    Mon, 02 Dec 2024   Prob (F-statistic):             0.0337
Time:                            13:38:04   Log-Likelihood:                 27.108
No. Observations:                       5   AIC:                            -46.22
Df Residuals:                           1   BIC:                            -47.78
Df Model:                               3                                         
Covariance Type:                nonrobust                                         
======================================================================================
                         coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------
const                  1.9435      0.160     12.153      0.052      -0.088       3.975
pop_25_less_9th_%     -0.0466      0.003    -14.102      0.045      -0.089      -0.005
pop_25_hs_grad_%      -0.0336      0.003    -10.392      0.061      -0.075       0.007
pop_25_bach_plus_%    -0.0400      0.003    -12.152      0.052      -0.082       0.002
==============================================================================
Omnibus:                          nan   Durbin-Watson:                   1.873
Prob(Omnibus):                    nan   Jarque-Bera (JB):                0.799
Skew:                          -0.967   Prob(JB):                        0.671
Kurtosis:                       2.698   Cond. No.                     4.58e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.58e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
/opt/anaconda3/lib/python3.11/site-packages/statsmodels/stats/stattools.py:74: ValueWarning:

omni_normtest is not valid with less than 8 observations; 5 samples were given.

Why Use This Method

This method normalizes incident counts by population size, allowing fair comparisons across boroughs. Regression analysis helps identify relationships between education levels as percentages and incident rates, while visualization makes these patterns clear.

By normalizing incident counts per 10,000 people, the method ensures fair comparisons across boroughs with different population sizes. Regression analysis quantifies the relationship between education levels and incident rates, providing insights into potential trends. Visualizations, such as scatter plots, make these relationships clear and accessible, helping to identify patterns and inform decision-making.

Analysis of the Graphs

The regression coefficients are very small, reflecting the rare nature of illegal pet ownership (0.2% to 1.0% of population). Less than 9th-grade education shows a negative coefficient (-0.0466), high school graduation shows a slight negative relationship (-0.0335), and bachelor’s degree shows a negative relationship (-0.0400).

The graphs appear to show strong negative relationships primarily due to scaling effects. While the x-axis spans a large range (showing education levels in thousands), the y-axis variation is minimal (0.2% to 1.0% incident rate). This disparity in scales makes even small changes appear more dramatic visually. For instance, the positive coefficient for high school graduation (3.576e-06) appears negative in the graph because the effect is so small relative to the axis scales.

These patterns, while statistically subtle, suggest that education levels have a minor influence on illegal pet ownership in NYC, though other factors likely play important roles in these relationships.

Limitations

The analysis is limited by the small sample size, with only five boroughs, which reduces the statistical power and makes it difficult to draw definitive conclusions. The use of borough-level aggregation may mask important variations within boroughs, potentially overlooking localized trends. Additionally, while the analysis identifies correlations between education levels and incident rates, it cannot establish causation.

Conclusion

Despite its limitations, this method offers a useful framework for understanding the relationship between educational attainment and illegal pet incidents in NYC. By normalizing data and employing regression analysis, the study provides insights that can inform policy decisions and highlight areas for further research. The visualizations effectively communicate complex data, making it accessible to a wider audience and supporting informed decision-making.

# 1. Download NYC borough boundaries GeoJSON
url = "https://data.cityofnewyork.us/api/geospatial/7t3b-ywvw?method=export&format=GeoJSON"
response = requests.get(url)

# Save the GeoJSON to a file
geojson_path = "nyc_boroughs.geojson"
with open(geojson_path, "wb") as file:
    file.write(response.content)

# Load the GeoJSON data
boroughs_gdf = gpd.read_file(geojson_path)
# Merge regression data with GeoJSON
map_data = boroughs_gdf.merge(district_data, left_on='boro_name', right_on='Borough', how='left')

# Create the map
m = folium.Map(location=[40.7128, -74.0060], zoom_start=10)

# Add choropleth for regression residuals
map_data['Residuals'] = model.resid  
folium.Choropleth(
    geo_data=map_data,
    data=map_data,
    columns=['boro_name', 'Residuals'],
    key_on='feature.properties.boro_name',
    fill_color='OrRd',
    fill_opacity=0.7,
    line_opacity=0.2,
    legend_name="Regression Residuals",
).add_to(m)

# Add district names as labels
for _, row in map_data.iterrows():
    folium.Marker(
        location=[row.geometry.centroid.y, row.geometry.centroid.x],  # Use centroid for label position
        popup=f"Borough: {row['boro_name']}\nResidual: {row['Residuals']:.2f}",
        icon=folium.DivIcon(html=f"<div style='font-size: 12px; color: black;'>{row['boro_name']}</div>")
    ).add_to(m)

# Save and display the map
m.save("regression_map_with_labels.html")
print("Map saved as 'regression_map_with_labels.html'. Open this file to view it.")
Map saved as 'regression_map_with_labels.html'. Open this file to view it.

The choropleth map of regression residuals reveals varying model performance across NYC boroughs, with the Bronx showing higher actual incident rates than predicted (red), Manhattan aligning closely with predictions (light orange), and Queens/Brooklyn showing moderate deviations (orange). These patterns suggest that while our education-based regression model captures some patterns in illegal pet ownership, it may not account for other important factors affecting incident rates, particularly in the Bronx. Limitations of this visualization include potential oversimplification of spatial patterns due to borough-level aggregation, which masks within-borough variations, and the challenge of interpreting residuals without considering the underlying population density or socioeconomic factors beyond education that might influence illegal pet ownership patterns.